In this project, red wine quality is going to be explored, and analyzed. I will uitilize several data analysis techniques to find insights in one or multiple variables using R.
library(ggplot2)
library(grid)
library(gridExtra)
library(GGally)
library(dplyr)
library(tidyr)
# Load the Data
dt <- read.csv('wineQualityReds.csv')
Let’s take a look at some summary statistics on the dataset first.
#summary statistic
str(dt)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(dt)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
In the dataset, there’re 13 variables and 1599 rows.The variable X should be the index.Quality is the “Y” variable we are interested and the rest except X are “X” variables that we are going to analyze their influence on the quality. When we look at quality, we found it ranged from 3 to 8 with an average of 5.6 and a median of 6.
Then i looked at the distribution plot of all 12 variable.
From the histogram, we can see that most variables are left skewed, with pH and density to be approxiamately normal distribution.For all the left skewed variables, residual sugar and chlorides seem to have long tails.
Then, let’s take a look at boxplots.
From the boxplots, we can see that most variables have outliers, expecially residual sugar and chlorides.We’ll decide if we nned to remove outliers later in the analysis.
The shape of the dataset is (1599,13). There’re 1599 wine records and 13 variables(with X the index of the dataset).
Among the 12 variables of the wine, first 11 are physicochemical data points on wine samples and the 12th one, quality, is an 10-point scale output based on sensory data from at least three wine experts.
The main feature of interest is quality. From the Univariate Plots Section, we see that it’s nearly a normal distribution where most of observations are in the 5-6 range.
Although all variables could potentially impact the wine quallity, through some high level research, the acidity is a major factor that influence the wine. So i think fixed acid , volatile acid, citric acid and pH are significent.
I created two variables. Quality_score and acidity.Quality score is to group the quality into three buckets - poor,mid and good. Because most wines are 5 or 6, i’ll assume 5,6 to be mid level.Everyone below 5 is poor, above 6 is good.Acidity is the combination of fixed acid , volatile acid and citric acid.The sum of three could be a significant feature.
The distribution of citric acid is unusual compared with fixed acidity and volatile acidity.The latter two are more like a bell shaped distribution but citric acid more like exponential. It appears that citric acid has a large number of null values, which could be incomplete or unavailable data.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 132 rows containing non-finite values (stat_bin).
In general, the dataset is tidy and no other cleaning needed.
For bivariate analysis, i’ll start with creating 11 box plots to find relationships between quality and each features. The reason i use quality score instead of quality is that we have fewer group in quality score which could help us see the relationship more clearly in the plot.
From the above boxplot, we could see that fixed acidity,volatile acidity and citric all have relationship with quality score. Fixed acidity and citric acidity have positive relationship while volatile has negative relationship. Acidity has a slightly positive relationship so i dont think the derived variable is better than 3 separate variables. Besides, sulphates and alcohol are also positively correlated with wine quality.
To demonstrate what we saw in the plot, i calculated teh correlation between each variables.
## X fixed.acidity volatile.acidity
## 0.06645261 0.12405165 -0.39055778
## citric.acid residual.sugar chlorides
## 0.22637251 0.01373164 -0.12890656
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05065606 -0.18510029 -0.17491923
## pH sulphates alcohol
## -0.05773139 0.25139708 0.47616632
## quality acidity quality_score
## 1.00000000 0.10375373 0.81236704
We can see that: 1. alcohol has the strongest correlation, followed by volatile acidity. 2. fixed acidity and citric acidity had a positive correlation while volatile acidity had a negative correlation. 3. sulfur.dioxide and pH have really low correlation(~ 0.05).
From the above boxplot, we could see that fixed acidity,volatile acidity and citric all have relationship with quality score. Fixed acidity and citric acidity have positive relationship while volatile has negative relationship. Acidity has a slightly positive relationship so i dont think the derived variable is better than 3 separate variables. Besides, sulphates and alcohol are also positively correlated with wine quality. Those findings could be demonstrated by the correlation test.
## cor
## -0.6829782
## cor
## 0.2349373
## cor
## -0.5419041
We could see that for fixed acidity and citric acidity, the relationship is negative.While for volatile acidity, the relationship is slightly positive.
Top 5 features most correlated with quality:
1.alcohol: 47.6% 2.volatile acidity: -39.1% 3.sulphates: 25.1% 4.citric acid: 22.6% 5.total.sulfur.dioxide: -18.5%
In this multivariate plot section, i’ll analyze if there’re any interactions between the above 5 features.
For alcohole and volatile.acidity, We could see a clear distinction of the surface with poor wine (high volatile acidity and low alcohol content) and good wine (low volatile acidity and high alcohol content).
For alcohole and citric acid, didnt find clear interaction
for alcohole and sulphate, we found a clear distinction of poor wine (low sulphate and low alcohol content) and good wine (high sulphate and high alcohol content).
for alcohol and total sulfur dioxide,didnt find clear interaction
for citric.acid and volatile.acidity, the distinction is not very clear, but still could see difference between poor(low citric.acid and high volatile.acidity) and good(high citric.acid and low volatile.acidity)
For total.sulfur.dioxide and volatile.acidity, didnt find clear pattern of difference.
From the multivariate plots, i found that alcohol with volitile.acidity and alcohol with sulphate bring strong interaction effect. citric.acid with volatile.acidity bring some interaction, but not very strong. alcohole with citric acid,alcohol with total sulfur dioxide and total.sulfur.dioxide with volatile.acidity bring no interaction effect. Although they each are strong correlated with quality, they didnt strengthen each other by interaction.
One interesting interaction is between citric.acid with volatile.acidity. They have some degree of interaction effect while for alcohol with citric acid, we could not find the interaction effect although alcohol is a more significant feature than volatile.acidity
From the final plots below, it can be found that volatile acidity, alcohol and sulphates contribute to good wines.
This plot tells us that good wine is more alcohol and more sulphate.Because there’s clear distinction between poor and good wines in the plot, i’ll say that alcohol and sulphate are two important factors influencing quality of wine.
This plot tells us that good wine is more alcohol and less volatile acidity.Because there’s clear distinction between poor and good wines in the plot, i’ll say that alcohol and volatile acidity are two important factors influencing quality of wine.
This plot tells us that good wine is more alcohol and more citric acidity.Reason i chose this plot is not only because i found clear difference in the plot, i also noticed that impact of citric acidity is opposite to volatile acidity(one is negative, the other is positive), which is interesting.
In conclusion, these three scatter plots tell us that good wine is more alcohol, more sulphate,more citric acid and less volatile acidity.And notice that citric acid and volatile acidity brings opposite impact to wine quality.
Exploratory data analysis proved to be very effective in understanding relationships within the red wine quality dataset. The project show us a systemetic way of analyzing and visualizing a dataset. It starts from univariate analysis, understanding the dataset and distribution of each variables. Although it seems useless in this project, it could be very helpful if there’re data quality issues in our dataset. The bivariate analysis later starts bringing insights of the dataset, helping finding most significant features that influence wine quality.I found the top 5 features that mostly correlated with the wine quality: 1.alcohol: 47.6% 2.volatile acidity: -39.1% 3.sulphates: 25.1% 4.citric acid: 22.6% 5.total.sulfur.dioxide: -18.5%
It helps me continuing my next step of analysis, the multivariate analysis.The third step is the most important one as it help find the key insights of the dataset, by revealing the interaction effect between variables.I found that alcohol with volitile.acidity and alcohol with sulphate bring strong interaction effect. Finally, by doing a final analysis, i came up with my conclusion that good wine is more alcohol, more sulphate and less volatile acidity.